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JOINT RANK AND VARIABLE SELECTION FOR PARSIMONIOUS ESTIMATION 
IN HIGH-DIMENSION FINITE MIXTURE REGRESSION MODEL. 


EMILIE DEVIJVER 


Abstract. We study a dimension reduction method for finite mixture of multivariate response re¬ 
gression models in high dimension. Both the number of responses and of predictors may exceed the 
sample size. We consider jointly predictor selection and rank reduction for obtaining lower-dimensional 
approximations of parameter matrices. This methodology was already developed in [8], In this paper, 
we prove that these estimators are adaptive to the unknown matrix sparsity. More precisely, we exhibit 
a penalty for which the model selected by the penalized likelihood satisfies an oracle inequality We 
support our theoretical result with simulation study and data analysis. 

1. Introduction 

The multivariate response regression model 

Y = /3X + E 

postulates a linear relationship between Y, the q x n matrix containing q responses for n subjects, and 
X , the p x n matrix on p predictor variables. The term E is an q x n matrix with independent columns, 
Ei ~ N q ( 0, E) for all i e {1,..., n}. The unknown q x p coefficient matrix (3 needs to be estimate. In 
a more general way, we could use finite mixture of linear models, which model the relationship between 
response and predictors, arising from different subpopulations: if the variable Y. conditionally to X , 
belongs to the cluster k, there exists f3k and E*, such that Y = (3kX + E, with E ~ J\f q ( 0, E*,). 

If we use this model to deal with high-dinrensional data, the number of variables can be quickly much 
larger than the sample size, because and predictors and response variables could be high dimensional. 
To solve this problem, we will have to reduce the parameter dimension. 

One way to solve the dimension problem is to select relevant variables, in order to reduce the number 
of unknowns. Indeed, all the information should not be interesting for the clustering. In a density 
estimation way, we could cite Pan and Shen, in nn, who focus on mean variable selection, Zhou and 
Pan, in 201 , who use the Lasso estimator to regularize Gaussian mixture model with general covariance 
matrices, Sun and Wang, in be who propose to regularize the k-means algorithm to deal with high¬ 
dimensional data, Guo et al, in jlOj . who propose a pairwise variable selection method. 

In a regression framework, we could use the Lasso estimator, introduced by Tibshirani in |19] , which 
is a sparse estimator, by penalizing the maximum likelihood estimator by the G-norm, which achieves 
the sparsity, as the £o penalty, but leads also to a convex optimization. Because we work with the 
multivariate linear model, to deal with the matrix structure, we could prefer the group Lasso, variables 
grouped by columns, which selects columns rather than coefficients. This estimator was introduced by 
Zhou in [21] in the general case. If we select | J| columns among the p possible, we have to estimate \J\q 
coefficients rather than pq for A , which could be smaller than n if | J| is smaller enough. 

Another estimator which reduces the dimension, known in the linear model, is the low rank estimator: 
introduced by Izenman in HI], and more used the last decades, with among others Bunea et al. in g] 
and Giraud in [9j, the regression matrix could be estimated by matrix of rank r, r < p A q. Then, we 
have to estimate r(p + q — r) coefficients, which could be smaller than n. 

In this paper, we have chosen to mix these two estimators to provide a sparse and low rank estimator 
in mixture models. This method was introduced by Bunea et al. in gj, in the case of linear model and 
known noise covariance matrix. They present different ways, more or less computational, with more or 
less good results in theory. 

They get an oracle inequality, which say that, among a model collection, they are able to choose an 
estimator with good rank and good active variables. For this model, Ma et al. in [12] get a minimax 
lower bound, which precise that they attain nearly optimal rates of convergence adaptively for square 
Schatten norm losses. 
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In this paper, we consider finite mixture of K linear models in high-dimension. This model is studied 
in details by Stadler et al. article for real response variable in im> and by Devijver for multivariate 
response variable in . We will estimate j3k for all k £ {1 ,,K} by a column sparse and low rank 
estimator. The Lasso estimator is used to select variables, whereas we refit the estimation by a low rank 
estimator, restricted on relevant variables. We propose a procedure which is based on a modeling that 
recasts variable selection, rank selection, and clustering problems into a model selection problem. This 
procedure is developed in [8], with methodology, computational issues, simulations and data analysis. In 
this paper, we focus on theoretical point of view, and developed simulations and data analysis for the low 
rank issue. Our procedure constructs a model collection, with models more or less sparse, and with rank 
vector more or less small. Among this collection, we have to select a model. We use the slope heuristic, 
which is a non-asymptotic criterion. In a theoretical way, in this paper, we get an oracle inequality for 
the collection constructed by our procedure, which makes a comparison performance between our model 
and the oracle for a specified penalty. 

This result is an extension of the work of Bunea et al in j3], to mixture models and with covariance 
matrices ( Ek)i<k<K unknown. They ensure that mixing sparse estimator and low rank matrix could be 
interesting. Indeed, whereas we have to estimate px q coefficients in each cluster for the regression matrix, 
we get only r(| J| + q — r) unknown variables, which could be smaller than the number of observations 
n if | J\ and r are small. Even if the oracle inequality we get in this paper is an extension of Bunea 
et al. result, we use a really different way to prove it. Considering the model collection constructed, 
we want to select a model as good as possible. For that, we use the slope heuristic, which constructs 
a penalty, proportional to the dimension, and select the model minimizing the penalized log-likelihood. 
Theoretically, we construct also a penalty, proportional to the dimension (up to a logarithm term). We 
provide an oracle inequality which compare, up to a constant, the Jensen-Kullback-Leibler divergence 
of our model and the true model to the Kullback-Leibler divergence between the oracle and the true 
model. In estimation term, we do as well as possible. This oracle inequality is deduced from a general 
model selection theorem for maximum likelihood estimator of Massart in m- Controlling the bracketing 
entropy of models, we could prove the result. Remark that we work in a regression framework, then we 
rather use an extension of this theorem proved in Cohen and Le Pennec article [5], and that our model 
collection is random, constructed by the Lasso, then we rather use an extension of this theorem proved 
in [7j. To illustrate this procedure, in a computational way, we validate it on simulated dataset, and 
benchmark dataset. If the data have a low rank structure, we could easily find it with our methodology. 

This paper is organized as follows. In the Section [2J we describe the finite mixture regression model 
used in this procedure, and the main step of the procedure. In the Section [31 we present the main result 
of this paper, which is an oracle inequality for the procedure proposed. Finally, in Section [I] we illustrate 
the procedure on simulated and benchmark dataset. Proof details of the oracle inequality are given in 
Appendix. 


2. The model and the model collection 

We introduce our procedure of estimation by sparse and low rank matrix in the linear model, in 
Section 12.11 and extend it in Section [2.21 in mixture models. 

2.1. Linear model. We consider the observations ((a:*, Vi ) i = i , ...,«) which realized random variables 
(X, Y), satisfying the linear model 

Y = /3X + e 

where Y £ R 9 are the responses, X £ are the regressors, j3 £ R px? is an unknown matrix, and e £ R 9 
are random errors, e ~ N q (0,E), with E £ §+ + a symmetric positive definite matrix. We will work in 
high-dimension, then p x q could be larger than the number of observations n. 

We will construct an estimator which is sparse and low rank for /3 to cope with the high-dimension 
issue. Moreover, to reduce the covariance matrix dimension, we compute a diagonal estimator of E. The 
procedure we will propose could be explained into two steps. First, we estimate the active columns of /3 
thanks to the Lasso estimator, for A > 0, using the estimator 

Px aSS ° = argmin {| \Y - 0X\ || + A||P||i} ; 

pGRPXQ 

where ||/?||i = S 2 U 1 We assume that the variance is unknown. From this approach, we 

could rescale the parameters by E -1 = P t P the Cholesky decomposition, and cj> = P -1 /3, and then get 
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the estimates by, T q denoting the set of triangular matrices of size q , for A > 0, 

(1) ( ^LassO ; pLasso )= argmin {|\ PY - cj>X \|| + A| \<j>\\,} . 

(0,P)eR px,! xT, 

This approach was considered in Stadler et al. in m, in scalar response case. This reparametrization 
is done in order to get a scale-invariant estimator and a convex minimization problem. 

For A > 0, from the Lasso estimator of </>^ asso , we could deduce the relevant columns. 

Restricted to these relevant columns, in the second step of the procedure, we compute a low rank 
estimator of /3, saying of rank at most r. Indeed, as explained in Giraud in [5], we restrict the maxi¬ 
mum likelihood estimator to have a rank at most r, keeping only the r biggest singular values in the 
corresponding decomposition. We get an explicit formula. 

This two steps procedure leads to an estimator of p which is sparse and has a low rank. We have 
also reduced the dimension into two ways. We refit the covariance matrix estimator by the maximum 
likelihood estimator. 

This estimator is studied in Bunea et al. in 0], in method 3. Let extend it in mixture models. 


2.2. Mixture model. We observe n independent couples (x, y) = ((xi, yi)i<i< n ) of random variables 
(X, Y), with Y £ R 9 and X £ R p . The conditional density is assumed to be a multivariate Gaussian 
mixture regression model. If the observation i belongs to the cluster fc, we assume that there exists 
Pk £ R pX9 , and £*, £ §+ + such that yt = p r Xi + £j where e, ~ N q ( 0, £ r ). 

Thus, the random response variable Y £ R 9 depends on a set of explanatory variables, written X £ R p , 
through a mixture of linear regression-type model. Give more precisions on the assumptions. 

• The variables Yi are independent conditionally to Xi , for all i = 1,. .., n ; 

• we let Yi\Xi = Xi ~ s^(y\xi)dy, with 


K 


( 2 ) 


sdv I*) = 5Z' 




{y - Pkx) t Y k 1 {y - p k x) 


(27r)2det(£ fc ) 1 / 2 \ 2 

£ = (TTi, ... ,7T fe , P u ... ,p k , Er,..., Y k ) £ (n K x (R 9Xp ) K x (§++)*) 

n*r = |(tti, ..., n K )-, 7 Tfc > 0 for k £ {1,. .., K} and ^ = 1 1 

is the set of symmetric positive definite matrices on R 9 . 


We want to estimate the conditional density function s% from the observations. For all k £ {1,..., I\ }, Pk 
is the matrix of regression coefficients, and £& is the covariance matrix in the mixture component 
k. The 7TfeS are the mixture proportions. In fact, for all k £ {1 for all z £ {l,...,g}, 

\P\x\z = 'Y^j=\[Pk\j,zXj is the 2 th component of the mean of the mixture component k for the con¬ 
ditional density s^(.|x). 

We could introduce, for all k £ {1,K}, 

Zk 1 = PkPk 
4>k = P ^ 1 Pk - 


We could define an extension of the Lasso estimator, 

(3) (ftr°,P? UKO )= argmin [\\PY - + A^ n k \\M\i 

(c/>,P)GRpxixT q { k=1 

Remark that the penalty take into account the mixture weight. We could also define a low rank estima¬ 
tor (P\ R ,P\ R ) restricted to relevant variables detected by the Lasso estimator, for the regularization 
parameter A. 

In a computational way, we will use two generalized EM algorithms, in order to deal with high 
dimensional data and get a sparse and low rank estimator. Give some details about those algorithms. 

Initially, the EM algorithm was introduced by Dempster in [B]. It alternates two steps, an expectation 
step to cluster data, and a maximization step to update estimation. In our procedure, we want first to 
know which columns are relevant, then we want to compute ©• This algorithm was used and explained 
in P 5 . It is a generalization of the EM algorithm, for the Lasso estimator, and in a regression context. 
From the estimate </>^ asso , we could deduce which columns are relevant. The second algorithm we use 
lead to determine tfik on relevant columns, for all k £ K}, with rank Rk . We alternate two steps, 
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E-step and M-step, until relative convergence of the parameters and of the likelihood. We restrict the 
dataset to relevant columns, and construct an estimator of size | J\ x q rather than p x q. 

• E-step: compute for k £ {l,...,if}, i £ {1the expected value of the log likelihood 
function, 


7L/c — E^(ite) |E) 

<Pk 


Ej=i w 


where 


ipi = 7r[ Itc ^ det P L 


>(ite) 


exp 


-i P, 


(ite) 


Yi-Xi® 


(ite) 


)‘0 


► (ite) 


Yi-Xifr 


(ite) \ 


• M-step: 

— To get estimation in linear model, we assign each observation in its estimated cluster, by the 
MAP principle. We could apply this thanks to the E-step, which compute the a posteriori 

probability. Therefore, we say that yi comes from component number argmax 7 ^. 

H.{\ ./<} 

— Then, we can define fit} ^ = (x| A ,X| fc )~ 1 x| fc y| fc , in which and y^ are the sample restric¬ 
tion to the cluster k. We decompose /^ ltc) in singular values such that /3j[. lte ' ) = USV t with 
S = diag(s 1 ,..., s q ) and si > S 2 > ■ • ■ > s q the singular values. Then, the estimator /3^ ltc " ) 
is defined by = US r V 4 with S r = diag(si ,..., s r , 0,..., 0). 

To select the regularization parameter for the Lasso estimator and the ranks, we could use criterion as 
BIC or cross-validation. In practice, we prefer to construct a model collection with various active columns 
and various ranks, and select one as final step. To get various active columns, we construct a data-driven 
grid of regularization parameters, coming from EM algorithm formula. See [H] for more details. To get 
various ranks, we estimate parameters for different values of ranks, belonging to {r m i n ,... ,r max }. 

From this procedure, we construct a model with K clusters, |J| relevant columns and matrix of 
regression coefficients of ranks R £ as described by the next model S(k,j,r)- 


(4) 

where 


S( K ,j,r) = {V G K 9 |a: £ >->• s^k,j, R ) (y|x)} 


s^k.j.r) (y\x) 
AK,J,R) 
E K 
4 '(K,J,R) 


K 

E 

k= 1 


7 rfc det(Pfc) 
(2tt)9/ 2 


exp (y~\{v - Pk {k) *i J1 )* S fc 1( dJ - P, 


PWJJ] 


( 7 r 1 ,...,7rx,/3f (1) ,...,/3^ J ,S 1 ,...,E^)£S x 


R(K) 


Rk x ^(k,j,r) x (S q + ) K ; 

{(/3f (1) ,...,/3f W ) £ (R 9X I J I) A |Rank(/3fc) = R(k)}. 


Varying K £ K. C N*, J £ J C V([l ,p]), and R £ 1Z C {1,... ,p A q} K , we get a model collection with 
various number of components, active columns and matrix of regression coefficients. 

Among this model collection, during the last step, a model has to be selected. As in Maugis and 
Michel in |14| . and in Maugis and Meynet in [15] , among others, a non asymptotic penalized criterion 
is used. The slope heuristic was introduced by Birge and Massart in [3], and developed in practice by 
Maugis and Michel in [2] with the Capushe package. To use it in our context, we have to extend the 
theoretical result to determine the penalty shape in the high-dimensional context, with a random model 
collection, in a regression framework. The main result is described in the next section, whereas proof 
details are given in appendix. 


3. Oracle inequality 

In a theoretical point of view, we want to ensure that the slope heuristic which construct a penalty 
will select a good model. We follow the approach developed by Birge and Massart in [3] which consists of 
defining a non asymptotic penalized criterion, leading to an oracle inequality. In the context of regression, 
Cohen and Le Pennec, in [5], and Devijver in [7], propose a general model selection theorem for maximum 
likelihood estimation. The result we get is a theoretic penalty, for which the model selected is as good 
as the best one, according to the Kullback-Leibler loss. 
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3.1. Framework and model collection. Among the model collection constructed by the procedure 
developed in section 12.21 with various rank and various sparsity, we want to select an estimator which 
is close to the truth. The oracle is by definition the model belonging to the collection which minimizes 
the contrast with the true model. In practice, we do not have access to the true model, then we can 
not know the oracle. Nevertheless, the goal of the model selection step of our procedure is to be nearest 
to the oracle. In this section, we present an oracle inequality, which mean that if we have penalized 
the log-likelihood in a good way, we will select a model which is as good as the oracle, according to the 
Kullback-Leibler loss. 

We consider the model collection defined by 0. 

Because we work in high dimension, p could be big, and it will be time-consuming to test all the parts 
of {1,... ,p}. We construct a sub-collection denoted by J L , which is constructed by the Lasso, which is 
also random. This step is explained in more details in [8]. 

Moreover, to get the oracle inequality, we assume that the parameters are bounded: 

S(k,j,r) = { s (k,j,r) e f° r k G {1,..., A}, 

— dicig( 1 , 1 5 • • • i Pfc]q',q')> 
for all z G {l,...,g},a E < [£&]*,* < A s , 

R(k) 

for all k G [l,K],P*( k) = [ a k]l[u{}.,i[v k ]i,., 

1 = 1 

for all l G {1,..., R{k)}, [<Jk]i < A a }. 

Remark that this decomposition of /?& is the singular value decomposition, with {cri)i<i<ii(k) the singular 
values, and Uk and Vk unit vectors, for k G {1,..., K}. 

We also assume that covariates belong to an hypercube: without restrictions, we could assume that 
X G [0, If. 

Fixing K, the possible number of components, J L the active column set constructed by the Lasso, 
and 1Z the possible vector of ranks, we get a model collection 

( 5 ) U U U S ?k,j,r)- 

kgIC jgJ ReR 

3.2. Notations. Before enunciate the main theorem which leads to the oracle inequality, for the model 
collection © we need to define some metrics used to compare the conditional densities. First, the 
Kullback-Leibler divergence is defined by 

KL\(s, t) = J log sdX 

for s and t two densities, sdX absolutely continuous with respect to tdX. To deal with regression data, 
for fixed covariates (x ±,..., x n ), we define 

KLf"(s,t) =E 

for s and t two densities, sdX absolutely continuous with respect to tdX. 

We also define the Jensen-Kullback-Leibler divergence, first introduced in Cohen and Le Pennec [5], 
by 

JKL\ p (s, t) = -KL\(s, (1 - p)s + pt) 

P 

for p G]0,1[, s and t two densities, sdX absolutely continuous with respect to tdX. The tensorized one is 
defined by 

= E JKL^p(s(.\xi),t(.\xi))^ . 

Note that those divergences are not metrics, not satisfying the triangular inequality and no symmetric, 
but are also wild used in statistics to compare two densities. 
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3.3. Oracle inequality. Let state the main theorem. 


Theorem 3.1. Assume that we observe ((a;,, yi)i<i< n ) with unknown conditional density so- Let M = 
K,xJxlZ and A4 L = K,xJ L x 1Z, where J L is constructed by the Lasso estimator. Let S(k,j,r) £ Sf K j R ) 
such that, for 5kl > 0, 

KLf"(s 0 ,s (KtJ>R) )< inf + — 

teS (K,J,R) 12 

and such that there exists r > 0 such that 

(6) i) ^ e T *o. 

Consider the collection {s(k,j,r)}(K, j,R)eM of rank constrained log-likelihood minimizer in S^x,j,r)j 
satisfying 

( i n 

S(K,J,R) = argmin <-^ log ( s (K,j,R)(yi\xi)) 

s (K,J,R)€Sf K JR ' ) ^ ^ 2 = 1 

Denote by D(k,j,r) the dimension of the model S( KJR y Let pen : At —»• R+ defined by, for all 

C K,J,R) € M, 


pen(K,J,R) > k (K -‘ LR) {[2B 2 (Ap, As, a a , q) 
n 

( D ( 


-log ( 
+ log 


d2 


\ n 


B 2 {Ap, As:,a a ,q) A 1^ 


4 epq 


L ) (K,J,R)-q 2 A PQ 


H - R 


for k > 0 an absolute constant. 


Then, the estimator s 


(. K,J,R ) 


, with 


{K, J, R) = argmin l -V log(s (i q ( yi\xi )) + pen(I\, J , R) 

( k,j,r)gm l { n i=1 


satisfies 


(7) 


for C > 0. 


E “ lo §(S(A, j,R)(yi\ x i)) + pen(K, J, f ) 


2—1 


< inf V- lo ^(K,J,R){Vi\xi)) + pen(K, J, R) + p 
(K,J,R)eM \ J 


E(JKL®l(s 0 , S(kJ,r))) 


<Ce( inf ( inf KLf n (so, t) + pe "^’ J ’ + — 
\(k,j,r)£M l \t€S( K ,j,R) n J n 


The proof of the theorem l3.1l is given in Section[5] Note that condition (0 leads to control the random 
model collection. The mixture parameters are bounded in order to construct brackets over S(k,j,r), anc l 
thus to upper bound the entropy number. The inequality 0 we obtain is not exactly an oracle inequality, 
since the Jensen-Kullback-Leibler risk is upper bounded by the Kullback-Leibler bias. 

Because we are looking on a small random sub-collection of models, our estimator S(k,j,R) is attainable 
in practice. Moreover, it is a non-asymptotic result, which allows us to study cases for which p increases 
with n. 

Note that we use the Jensen-Kullback-Leibler divergence rather than the Kullback-Leibler divergence, 
because it is bounded. This boundedness turns out to be crucial to control the loss of the penalized max¬ 
imum likelihood estimator under mild assumptions on the complexity of the model and their collection. 

We could compare our bound with the one of Bunea et al, in [¥j, who computed a procedure similar to 
ours, in a linear model. According to consistent group selection for the group Lasso, they get adaptivity 
of the estimator to an optimal rate, and their estimators perform the bias variance trade-off among 
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Table 1. Performances of our procedure. Mean number {K, R, M, FA, ARI} of the 
Kullback-Leibler divergence between the model selected and the true model, the esti¬ 
mated rank of the model selected in each cluster, the missed variables, the false relevant 
variables, and the ARI, over 20 simulations. 



KL 

R 

M 

FA 

ARI 

p> n 

19.03 

[2.8,3] 

0 

20 

0.95 

p < n 

3.28 

[3,3] 

0 

0.6 

0.99 


all reduced rank estimators. Nevertheless, their results are obtained according to some assumption, 
for instance the mutual coherence on X f X, which postulates that the off-diagonal elements have to be 
small. Some assumptions on the design are required, whereas our result just need to deal with bounded 
parameters and bounded covariates. 


4. Numerical studies 


We will illustrate our procedure with simulations and real datasets, to highlight advantages of our 
method. We adapt the simulations part of Bunea et al. article, [1]. Indeed, we work in the same way, to 
get sparse and low rank estimator. Nevertheless, we add the mixture to be consistent with our clustering 
method, and to have more flexibility. 


4.1. Simulations. To illustrate our procedure, we use simulations adapted from the article of Bunea 
[ 3 ], extended for mixture models. 

The design matrix X has iid rows Xi from a multivariate normal distribution A(0,£) with £ = pi, 
p > 0. We consider a mixture with 2 components. Depending on the cluster, the coefficient matrix (3 k 
has the form 


b k B° 

0 


b k B 1 

0 


for k £ {1, 2}, with B° a J x R(k) matrix and B 1 a R(k) x q matrix. All entries in B° and B 1 are iid 
1V(0, 1). The noise matrix E has independent iV(0,1) entries. Let Ei denote its «th row. 

The proportion vector 7 r is defined by n = [i, 5 ], all clusters having the same probability. 

Each row 1} in Y is then generated as, if the observation i belongs to the cluster k , Yi = /3 k Xi + Ei, 
for 1 < i < n. This setup contains many noise features, but the relevant one lie in a low-dimensional 
subspace. We report two settings: 

• p> n: n = 50, | J| =6 ,p = 100, q = 10, R = [3,3], p = 0.1, b = [3, —3]. 

• p < n: n = 200, | J\ = 6,p = 10, q = 10, R = [3,3], p = 0.01, b = [3, —3]. 

The current setups show that variable selection, without taking the rank information into consideration, 
may be suboptimal, even if the correlations between predictors are low. Each model was simulated 20 
times, and table [1] summarizes our findings. We evaluate the prediction accuracy of each estimator /3 by 
the mean squared error (MSE) using a test sample at each run. We also report the median rank estimate 
(denoted by R) over all runs, rates of non included true variables (denoted by M for misses) and the 
rates of incorrectly included variables (FA for false actives). Ideally, we are looking for a model with 
low MSE, low M and low FA. 

We can draw the following conclusions from table [Tj When we work in low-dimensional framework, we 
get very good results. Even if we could use any estimator, because we do not have dimension problem, 
with our procedure we get the matrix structure involved by the model. We get almost exact clustering, 
and the Kullback-Leibler divergence between the model we construct and the true model is really low. 
In case of high-dimensional data, when p is larger than n, we get also good results. We find the good 
structure, selecting the relevant variables (our model will have false relevant variables, but no missed 
variables), and selecting the true ranks. We could remark that false relevant variables have low values. 
Comparing to another procedure which will not reduce the rank, we will perform the dimension reduction. 


4.2. Application. In this section, we apply our procedure to real data set. The Norwegian paper 
quality data were obtained from a controlled experiment that was carried out at a paper factory in 
Norway to uncover the effect of three control variables Xi,X 2 ,X% on the quality of the paper which 
was measured by 13 response variables. Each of the control variables A} takes values in {—1,0,1}. 
To account for possible interactions and nonlinear effects, second order terms were added to the set of 
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predictors, yielding Ai, A 2 , X 3 , , Xf, X|, AiX 2 , X 1 X 3 , A 2 X 3 , and the intercept term. There were 29 
observations with no missing values made on all response and predictor variables. The Box Behnken 
design of the experiment and the resulting data are described in Aldrin [T; and Izenman HU. Moreover, 
Bunea et al in j4] also study this dataset. We always center the responses and the predictors. The 
dataset clearly indicates that dimension reduction is possible, making it a typical application for reduced 
rank regression methods. Moreover, our method will exhibit different classes among this sample. 

We construct a model collection varying the number of clusters in K, = {2,..., 5}. We select a model 
with 2 classes. We select all variables except XiX 2 and X 2 X 3 , which is consistent with comments of 
Bunea et al. In the two classes, we get two mean matrices, with ranks equal to 2 and 4. One cluster 
describes the mean comportment (with rank equal to 2 ), whereas the other cluster contains values more 
different. 


5. Appendices 

In those appendices, we present the details of the proof of the theorem ld.il This derives from a general 
model selection theorem, enunciated in section I5T1 and proved in the paper t 7j. Then, the proof of the 
theorem I3J1 could be summarized by satisfying assumptions [lj [ 2 ] and [3] described in section [5TTI 


5.1. A general oracle inequality for model selection. Model selection appears with the AIC crite¬ 
rion and BIC criterion. In a non-asymptotic way, a theory was developed by Birge and Massart in [3]. 
With some assumptions that we will detail here, we get an oracle inequality for the maximum likelihood 
estimator among a model collection. Le Pennec and Cohen, in pjj, generalize this theorem in regression 
framework. We have to use a generalization of this theorem because we consider a random collection of 
models. 

Let us consider a model collection 

Before enunciate the general theorem, begin by talking about the assumptions. First, we impose a 
structural assumption. It is a bracketing entropy condition on the model S m with respect to the Hellinger 
divergence 


H 


= E 


n 

\ x i)A-\ x i)) 

n z ' 


i= 1 


A bracket [t ,t + ] is a pair of functions such that for all (x,y) £ X x y,t (y,x) < s(y\x) < t + (y,x). 
The bracketing entropy Hy ](d, S, df^ n ) of a set S is defined as the logarithm of the minimum number 
of brackets [t~,t + ] of width djf 1 {t~ ,t + ) smaller than 6 such that every functions of S belong to one of 
these brackets. 


Assumption 1 (H m ). There is a non-decreasing function (f> m such that 6 1 —> ^(f> m ( 8) is non-increasing 
on ( 0 , +00) and for every a £ R + and every s m £ S m , 

\JSm(s mi a),d% n )dS < </> m (cr); 

where S m (s m , a) = {t £ S m , djf (t, s m ) < a}. The model complexity V m is then defined as na ^ with a m 
the unique root of 

( 8 ) —<t> m (cr) = yjner. 

<7 

Denote that the model complexity depends on the bracketing entropies not of the global models S m 
but of the ones of smaller localized sets. This is a weaker assumption. 

For technical reason, a separability assumption is also required. 

Assumption 2 (Sep m ). There exists a countable subset S rn of S m and a set y m with X(y \y m ) = 0, 
for A the Lebesgue measure, such that for every t £ S m , there exists a sequence (tk)k> 1 of elements of 
S m such that for every x and every y £ y m , \og(tk(y\x)) goes to log(t(y|a:)) as k goes to infinity. 

We also need an information theory type assumption on our model collection. We assume the existence 
of a Kraft-type inequality for the collection. 

Assumption 3 (K). There is a family (x m ) me _M of non-negative numbers such that 

y; e~ xm < e < +00. 
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Then, we could write our main global theorem to get an oracle inequality in regression framework, 
with a random collection of models. 

Theorem 5.1. Assume we observe ((xi,yi)i<i< n ) with unknown conditional density s o- Let S = 
{Sm)m£M be at most countable collection of conditional density sets. Let assumption (K) holds while 
assumptions ( H m ) and ( Sep m ) hold for every models S m £ S. Let Skl > 0, and s m £ S m such that 

KLf"{a 0 ,s m ) < inf KLf"(a 0 ,t)+ —; 

teSm. n 


and let t > 0 such that 


(9) s m > e T s 0 . 

Introduce {S m ) rnG jO t some random sub-collection of Consider the collection (s m ) m£ jCi °f 

rj-log-likelihood minimizer in S m , satisfying 


E 


log(s m (yj|a;j)) < 



— log(s m (j/j|a:i)) +??. 


Then, for any p £ (0,1) and any C i > 1, there are two constants kq and C 2 depending only on p and 
C\ such that, as soon as for every index m £ M, 


(10) pen(in) > n(D m + (1 V r) x m ) 

with n > k 0 , and where the model complexity T> m is defined in (|5|). the penalized likelihood estimate 
§rh with rh £ M. such that 


E “ lo g(sm(2/i|*i)) + pen(m) < inf 

- , m£M 


[ - log(s m (yi\xi)) + pen(m) + rj 


satisfies 

(11) E(JKL®l(a 0 ,Sm))<CiE( inf inf KLf n (s 0 , t) + 2^^-) 

(12) +C2 (lvr)^ + ^. 

n n 

Remark 5.2. We get that, among a random model collection, we are able to choose one which is as 
good as the oracle, up to a constant C\, and some additive terms being around L. This result is non- 
asymptotic, and gives a theoretic penalty to select this model. 

Remark 5.3. The proof of this theorem is detailed in (7J. Nevertheless, we could give the main ideas 
to understand the assumptions. From assumptions [7] and we could use maximal inequalities which 
lead to, except on a set of probability less than e~ x C~ x ; for all x, a control of the ratio of the centered 
empirical process o/log(s m ') over the Hellinger distance between sq and s m i, this control being around L. 
Thanks to Bernstein inequality, satisfied according to the inequality ([7]), and thanks to the assumption 
0 we get the oracle inequality. 

Now, to prove theorem m, we have to satisfy assumptions ©, ©, and 0. 


5.2. Assumption ( H m ). 


5.2.1. Decomposition. As done in Cohen and Le Pennec (5], we could decompose the entropy by 
^[■]i e i^£K,j,R)i d% n ) — , dff 1 ) + 




where 


( 2 / e M 9 |x e W H> se(y|a;) = EfcLi 7Tr$(y|(/3f (A) a;)|j, £fc) 

j e = {tt!, ... ,7 tk,P? w , ■ • .,Pk { k) ^ 

[ ©fe = n A ' x ^(k,j,r) x ([as,^s]+*) K 
= {(/3f ( 1 ) ,...,/3f W ) € [-A^A^IRank^) = f?(fc)} 


life = < (7T1, ... , 7 Tk) e (0, 1) X ; y^TTfc = 1 


= {$(.|(/3 R X)|./,£);/3 g [a/j.^l^I.rank^) 

E = diag(£f,..., E^) G [a|, A|;] 9 } 

$ the Gaussian density. 


For the proportions, we get that 


H [ . ] (e,n K ,d%-) < log (k( 27re) K / 2 (^j 


Looking after the Gaussian entropy. 

5.2.2. For the Gaussian. We want to bound the integrated entropy. For that, first we have to construct 
some brackets to recover S m . Fix / G S m - We are looking for functions t~ and t + such that t~ < f < t + . 
Because / is a Gaussian, t~ and t + are dilatations of Gaussians. We then have to determine the mean, 
the variance and the dilatation coefficient of t~ and t + . We need the both following lemmas to construct 
these brackets. 

Lemme 5.4. Let < F(.|/ri, Ei) and $(.|^ 2 ,£ 2 ) be two Gaussian densities. If their variance matrices are 
assumed to be diagonal, with E a = diag([S%\ i, ■ • •, [5 a ]q) for a G {1, 2}, such that [S 2 ]^ > [Si]^ > 0 for 
all z G {1,..., q}, then for all x G R 9 , 

^(sl/^S 1 ) A VPA |(Ml-M 2 )G»a g ( |S2h f |Slh |E 2 | g I| Sl ] g ) (M 1 -M 2 ) 

$(*|M2,E 2 )^ll lV ^ p 


Lemme 5.5. The Hellinger distance of two Gaussian densities with diagonal variance matrices is given 
by the following expression: 


dim. |M1, Si), $(> 2 , S 2 )) = 2 - 2 ( n 

\ qi =i L^im \. Zj 2 \qi 


x exp < — — (/xi - H 2 Ydiag 


(mi - M 2 ) 


To get an e-bracket for the densities, we have to construct a 5-net for the variance and the mean, S 
to be precised later. 

• Step 1: construction of a net for the variance 

Let e G]0,1], and 5 = Let b? = (1 + For 2 < j < N, we have [as, As] = 

[ 6 jv, 6 jv-i] U • ■ • U[& 3 > 62 ]) where N is chosen to recover everything. We want that 

al = (l + Sy- N ^Al 

* 21 og^= (l-^]\og(l + 6) 


4 log (j^VT+S) 
log(l + 5) 


41 °s(-si-v / i+<5) 


We want N to be an integer, then N = | — log^i+a) — |. We get a regular net for the variance. 

We could let B = diag(6 ^ 1)5 ... ,6?^), close to E (and deterministic, independent of the values 
of E), where i is a permutation such that &j(z)+i < E z>z < bu z ) for all z G {1,..., q}. 



• Step 2: construction of a net for the mean vectors 

We use the singular decomposition of /?, /? = ^2iLi a i u \ v h with the singular values, 

and ( ui)i<i<r and (vi)i<i<r unit vectors. Those vectors are also bounded. 

We are looking for t~ and t + such that dn(t~,t + ) < e, and t~ < f < t + . We will use a 
dilatation of a Gaussian to construct such an e-bracket of $. 

We let 

t~(x,y) = (l + Sy^+^Qiyl^nx^l + Sr^B 1 ^) 
t + (x, y) = (1 + S)^ r+3 ^(y\uj, R x, (1 + S)B l ) 


where B l and B l+1 are constructed such that, for all z € {1,..., q}, B l +J- < £ ZjZ < B l z z (see step 1). 
The means vj.r & K . pX9 will be specified later. Just remark that J is the set of the relevant columns, 
and R the rank of vjr\ we will decompose vj.r = z!h viujvi, u g R |J|xfl , and v e R« XjR . 

We get 

t~{x,y) < f(y\x) < t + (x,y) 

if we have 

\\P X - vj,rx\\1 ^ “ 2 " 1/4 ). 

Remark that \\j3x — vj,rx'\\\ < p\\0 — fo/./illllMloo We need then 
(13) \\p-Vj,R\\l<pqR^al{l-2- 1/i ) 

According to 7], dn(t~ ,t + ) < 2 (p 2 qR + 3q/4) 2 5 2 , then, with 


V2[p 2 qr + 3/4g) 


we get the wanted bound. 

Now, explain how to construct vj.r to get m 


p q R 

||/3 — ^ J, H 112 = EEE I criuijVi, z - aui,jVi tZ | 2 

j= 1 2=1 1=1 
p q R 

= EEE || <Tl - O’ 1 1 \UjjV z ,l | - 5-1 UlJ - Uij\\v Zj i\ - <TlUlj\vi iZ - Vl,z 

j =1 z=l 1=1 

p q R 

< EEE ii°i ~ °i\ + A °\ fob - fo.ji+A tK,* - vi, z ii 2 

j = l 2 = 1 J=1 

< 2 pqr ( max \cn - fo | 2 + A CT max|iij j - uij | 2 + A CT max|i;* jZ - v* jZ | 2 

We need ||/3 — — 2 -1 / 4 ). 

If we choose 07 such that 

|cr; - C7*| < -^=a s \/l - 2 -1 / 4 , 

and foj such that 

\ U U - Uij I < as^l-2- 1 / 4 , 

and v* jZ such that 

c 

|«j,z - vi :Z | < a s \/l-2- 1 / 4 , 

v 12 Ac, 

then it works. 

To get this, we let, for [_-J begin the floor function, 















and 


a i = argmin 
ce S 




Vl2 


as 


V 7 ! - 2~ 1 I\ 


and 


U = zn 


A n 


-a s V 1-2-V 4 


L s/viA, 


and 


uij = argmin 
pel/ 


u l,j 


y/UA ( 


-as 


\/l - 2-V4 a 


and 


v = zn 


A n 




and 


Vij = argmin 
vev 


V lj 


6 

— 3 =—as 

Vl2A a 


\J 1 - 2~ 1 / 4 v 


Remark that we just need to determine vectors {(ui,j)i<j<j-i)i<i< R and ((fii, 2 )i<z< 9 -z)i<j<.R because 
those vectors are unit. 

Then, we let 


Vj G J c ,Wz G [1 ,p\, (; vj,n)j, z = 0 

R 

Vj G J,Vz G [1 ,p\, (Vj,R)j,z = T, °lUl,jVl,z 

1=1 


We have defined our bracket. 
We want to control the entropy. 


swflf- 


An 


R(k) 


An 


R(k)(\J\+q-R(k)-l) 


as 


< N 


k= 1 \ v^L2 A 

( AnVl2 


y/1 - 2-!/4 


\ \Zl2A, 

E^fl(fc)(|J|+9--R(fc)) 


-as 


\/l — 2~ 1 / 4 


SasV 1 - 2-V4 


< 4 _|_ 1 \ I ArVl 2 




as 2/ l asVl — 2 -1 / 4 


( 5-Ef = i«( fc )(l‘ 7 l+<?-«(fc))+i 


Then, 


y]( e A(K,j,R)AH) < C 




where 

(14) 


C = 4 (^ + I") f 

\ as 2/ y asVl — 2 

x ( V2(p 2 qR + 3/4q)) 


.T,Z =1 R(k)(\J\+q-R(k)) 

- 2-V4 J 

££=a,.-R(fc)(|J|+9-fl(k)) 


and D( K>JtR) = Y%= 1 R ( k )(\ J \ +Q~ R ( k ))- 


12 



5.2.3. For the mixture. We have to determine 4>(k,j,r) such that 
(15) 

Let compute the integral. 


J \JS( Kt j tR )(s( Kt j tR ),a),d < f] [ n )de < 

•al. 

J \JH[.] Ob S(k,j,r) (S(K,J,R) , o'), df[ n )de 
< a^/\og(C 2 ) + sJd {KJ jR) ^j log 


< JD 


\k,j,r)& 


log(C 2 ) 


D 


(. K,J,R ) 


/ log 


1 


a A 1 


with, according to m 


log(C 2 ) = log(4) + ^ log(27re) + (K - 1) log(3) + log (— + ^ 

Z \ as Z 

+ D (KJM loi(-jdl2=^ 

+ D(K,J,R) log ^ V2(p 2 qR + ^< 7 )^ + log(A') 

< d (k,j,r) ^log(4) + 1 ° g ^ 7re ^ + log 3^ 

+ d (k,j,r) (^log + log {{p 2 qR + ^q)V2 ) + 1 j 


( ( 24\/67re 

< 0 WJ , B) (losl Vl _ 2 _ 1/4 


log (p 2 qR+-q) 


log 


4b + i\ (A* 

as 2 J l as 


Then, 


J \J H[.]{e, S( K j tR ){s( K j tR ),(j), dfr)de 

<V- D («,«)(3 + \/ 1 »g(^ + ))(^ 


Consequently, by putting 


+ \/log(p 2 gi?+ -q) + J log 


1 


a A 1 


B = 3+ \jlog ( — + - ) ( — ) + \j\og{p 2 qR+ -q) 


as 2 J \ as 

we get that the function '&(k,j,r) defined on R+ by 


4\K,j,R)(a) - + ^log 


satisfies m- Besides, ^(k,j,r) is nondecreasing and a H > j s nonincreasing, then ^(k,j,r) is 

convenient. 

Finally, we need to find an upper bound of cr* satisfying 

'3 / (x,j,fl)(c r *) = Vnal. 

Consider cr* such that 41 (k,j,r)(&*) = \f&l- 
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This is equivalent to solve 


and then we could choose 


a* = \ — ( B + J log 


crl< — \2B 2 + log 


1 


cr* A 1 

1 


1 A —B 2 


5.3. Assumption ( K ). Let x^k,j,r) = log ^ 

could compute the sum 


_4 epq _ 

(D (KJ) -q 2 )/\pq 


j + maxfc g { li R{k). Then, we 


y g -X(K,J,R) — 

( K,J,R) 


IOS ((D K ,j-Aap5 ) ^ Yi 

yA> 0 1<| J\<pq J R 


-R 


The term A is controlled in proposition 4.5 in [7 by 2. Then, 

Y e -H K ,J,R) < 2 . 


( K,J,R ) 
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